-
Notifications
You must be signed in to change notification settings - Fork 211
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement column projection #1443
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a few comments, please take a look! The PR looks great already. Thanks for working on this!
…tion logic to helper method, changed test to use high-level table scan
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
generally LGTM! I added a few nit comments and some clarifying questions on testing.
thanks for working on this!
tests/io/test_pyarrow.py
Outdated
partition_spec = PartitionSpec( | ||
PartitionField(2, 1000, VoidTransform(), "void_partition_id"), | ||
PartitionField(2, 1001, IdentityTransform(), "partition_id"), | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think we'd want to test multiple IdentityTransform
s here.
im thinking about a case for multiple-level of partitioning in hive-style.
s3://my_table/a=100/b=foo/...parquet
i think _get_column_projection_values
might not support this right now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, got it, I think it is supported with this new commit, what is doing is that before injecting the value in the RecordBatch, it checks if that name is present in the schema before injecting it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like CI caught an interesting case where a new identity partition is added after data files were written. The accessor then cannot find the proper partition record... We need to do something like this
) | ||
|
||
partition_spec = PartitionSpec( | ||
PartitionField(2, 1000, IdentityTransform(), "void_partition_id"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: avoid using void
since its a type of transform https://iceberg.apache.org/spec/#partition-transforms
partition_id: int64 | ||
---- | ||
other_field: [["foo"]] | ||
partition_id: [[1]]""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldnt this project void_partition_id=12
as well?
) | ||
|
||
|
||
def test_identity_transform_columns_projection(tmp_path: str, catalog: InMemoryCatalog) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i was thinking we can test something like 3 fields where 2 are identity partitions.
to check the scenario for multi-level hive partition, for example s3://foo/year=2025/month=06/blah.parquet
This is a fix for issue #1401. In which table scans needed to infer partition column by following the column projection rules
Fixes #1401